feat: add component contributor test harness by ArangoGutierrez · Pull Request #508 · NVIDIA/aicr

ArangoGutierrez · 2026-04-08T18:41:29Z

Summary

Validate AICR components end-to-end with a single command — no GPU hardware required for most components.

make component-test COMPONENT=cert-manager

Three test tiers (auto-detected from registry.yaml): scheduling (KWOK redirect), deploy (Kind + bundle + health check), gpu-aware (Kind + nvml-mock + deploy + health check)
nvml-mock integration using ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters (arm64 + amd64, includes nvidia-smi)
Bundler bugfix: deploy.sh template now conditionally includes --version flag — fixes broken helm commands for components without defaultVersion in registry (e.g., gpu-operator)

New files

tools/component-test/ — 7 scripts (detect-tier, ensure-cluster, setup-gpu-mock, deploy-component, run-health-check, cleanup), Kind config, nvml-mock manifest, README
Makefile targets: component-test, component-detect, component-cluster, component-deploy, component-health, component-cleanup
Documentation updates in DEVELOPMENT.md and CONTRIBUTING.md

Test Plan

make test — all unit tests pass (72.1% coverage)
make component-test COMPONENT=cert-manager — deploy tier end-to-end (build → deploy → health check → cleanup)
make component-test COMPONENT=gpu-operator TIER=gpu-aware — gpu-aware tier end-to-end (build → nvml-mock → deploy → health check → cleanup)
make component-test COMPONENT=cert-manager TIER=scheduling — scheduling tier redirects to KWOK
New tests: TestGenerateDeployScript_EmptyVersionOmitsFlag, TestGenerateDeployScript_WithVersionIncludesFlag

kannon92 · 2026-04-08T18:50:02Z

So rather than go with mock GPUs is there a way we could have a CPU flavor?

I like that pattern for llama.cpp or vllm.

ArangoGutierrez · 2026-04-08T18:55:34Z

So rather than go with mock GPUs is there a way we could have a CPU flavor?

I like that pattern for llama.cpp or vllm.

Good question — the harness actually already has a GPU-free path. The deploy tier validates components in plain Kind without any GPU mock (cert-manager, kai-scheduler, etc. use this today).

The nvml-mock layer is specifically for components that gate on GPU presence during init — gpu-operator, nvidia-device-plugin, DRA driver — they won't even start their reconciliation loop unless they
detect NVML libraries and device nodes on the host. There's no CPU flavor of those because their entire purpose is managing GPU hardware.

For inference workloads like llama.cpp or vLLM, a CPU flavor would make sense as a complementary pattern — deploy the serving stack with a CPU backend and validate the end-to-end request path. That's a
higher-level integration test than what this harness targets (component deployment + health check), but it could be built on top of it.

So both patterns have a place:

nvml-mock: GPU infrastructure components that check for hardware at init
CPU flavors: inference/serving workloads that can run with CPU backends

kannon92

Thanks for this! This should help me a lot of Kueue work.

ArangoGutierrez · 2026-04-08T20:08:04Z

CI is passing, ready for review @yuanchen8911 / @mchmarny

yuanchen8911 · 2026-04-08T23:36:35Z

Cross-Review Summary for PR #508

Reviewers: Claude Code, Codex, CodeRabbit + Integration Analysis
Rounds: 1 + Codex follow-up
Consensus reached: Yes

Confirmed Issues

#	Severity	Finding	Confirmed By
1	Low	`cleanup.sh` interactive `read` prompt blocks without a TTY — When `DELETE_CLUSTER=true` and `FORCE_CLEANUP` is unset, `cleanup.sh#L95-L109` calls `read -r -p` which hangs in non-interactive environments. Not on the main `component-test` happy path, but the README documents `make component-cleanup DELETE_CLUSTER=true` as a supported command.	Codex + CodeRabbit
2	Low	Scheduling tier silently succeeds without testing — `make component-test COMPONENT=<scheduling-component>` exits 0 via `Makefile#L602-L619` and `ensure-cluster.sh#L46-L55` after printing guidance. Conflicts with the README's promise that the harness "auto-detects the right test tier, creates a Kind cluster, deploys the component, and runs its health check."	Codex + CodeRabbit

Cross-review by Claude Code + Codex + CodeRabbit

yuanchen8911

left some comments.

yuanchen8911

There are two issues. Both are low severity, but they affect correctness of the new contributor workflow, so I would suggest changes.

yuanchen8911

LGTM — both review issues are addressed in 481035d. Needs a rebase onto main and re-approval from @kannon92

Validate AICR components end-to-end with a single command: make component-test COMPONENT=cert-manager Three test tiers, auto-detected from registry.yaml: - scheduling: redirects to existing KWOK infrastructure - deploy: Kind cluster + aicr bundle + chainsaw health check - gpu-aware: Kind + nvml-mock DaemonSet + deploy + health check New files: - tools/component-test/{detect-tier,ensure-cluster,setup-gpu-mock, deploy-component,run-health-check,cleanup}.sh - tools/component-test/{kind-config.yaml,manifests/nvml-mock.yaml,README.md} Makefile targets: component-test, component-detect, component-cluster, component-deploy, component-health, component-cleanup. Uses ghcr.io/nvidia/nvml-mock:0.1.0 for GPU simulation in Kind clusters (arm64+amd64, includes nvidia-smi). Tested end-to-end: - deploy tier: cert-manager (build → deploy → health check → cleanup) - gpu-aware tier: gpu-operator (build → nvml-mock → deploy → health check → cleanup) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The deploy.sh template unconditionally included '--version {{ .Version }}' which produced a broken helm command when Version was empty (e.g., gpu-operator has no defaultVersion in registry.yaml). Helm 4 treats the empty --version as a missing required argument. The template now conditionally includes --version only when Version is non-empty, allowing components without pinned versions to install the latest chart from the repository. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

- cleanup.sh: Detect non-interactive mode (no TTY) and fail with a clear error instead of hanging on 'read' when DELETE_CLUSTER=true without FORCE_CLEANUP=true. - Makefile: Scheduling tier now exits with code 2 instead of 0 to signal that no test was executed, with guidance to use make kwok-e2e. - README: Clarify that scheduling tier redirects to KWOK and does not create a Kind cluster. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez · 2026-04-09T17:32:06Z

PTAL @kannon92 / @mchmarny

kannon92 · 2026-04-09T19:14:50Z

I'm not an approver here but last I looked PR was good to me.

ArangoGutierrez requested review from a team as code owners April 8, 2026 18:41

github-actions bot added area/recipes area/docs area/bundler size/XL labels Apr 8, 2026

ArangoGutierrez mentioned this pull request Apr 8, 2026

add kueue components as an option #490

Merged

25 tasks

mchmarny assigned ArangoGutierrez Apr 8, 2026

ArangoGutierrez force-pushed the feature/component-test-harness branch from d84bc0a to 45ddbbe Compare April 8, 2026 19:08

kannon92 reviewed Apr 8, 2026

View reviewed changes

yuanchen8911 requested changes Apr 9, 2026

View reviewed changes

yuanchen8911 self-requested a review April 9, 2026 00:29

yuanchen8911 reviewed Apr 9, 2026

View reviewed changes

ArangoGutierrez force-pushed the feature/component-test-harness branch from 2844c50 to 481035d Compare April 9, 2026 06:20

ArangoGutierrez requested a review from yuanchen8911 April 9, 2026 06:21

yuanchen8911 reviewed Apr 9, 2026

View reviewed changes

ArangoGutierrez added 3 commits April 9, 2026 19:22

ArangoGutierrez force-pushed the feature/component-test-harness branch from 481035d to 4584c25 Compare April 9, 2026 17:23

ArangoGutierrez requested a review from kannon92 April 9, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add component contributor test harness#508

feat: add component contributor test harness#508
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:feature/component-test-harness

ArangoGutierrez commented Apr 8, 2026

Uh oh!

kannon92 commented Apr 8, 2026

Uh oh!

ArangoGutierrez commented Apr 8, 2026

Uh oh!

kannon92 left a comment

Uh oh!

ArangoGutierrez commented Apr 8, 2026

Uh oh!

yuanchen8911 commented Apr 8, 2026 •

edited

Loading

Uh oh!

yuanchen8911 left a comment

Uh oh!

yuanchen8911 left a comment

Uh oh!

yuanchen8911 left a comment •

edited

Loading

Uh oh!

ArangoGutierrez commented Apr 9, 2026

Uh oh!

kannon92 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArangoGutierrez commented Apr 8, 2026

Summary

New files

Test Plan

Uh oh!

kannon92 commented Apr 8, 2026

Uh oh!

ArangoGutierrez commented Apr 8, 2026

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Apr 8, 2026

Uh oh!

yuanchen8911 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cross-Review Summary for PR #508

Confirmed Issues

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Apr 9, 2026

Uh oh!

kannon92 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 8, 2026 •

edited

Loading

yuanchen8911 left a comment •

edited

Loading